Our project is mainly focus on the obesity issue in the world. We will analyze the topic by exploring 2 dataset - one is the share of death related to obesity across the world, another dataset we will dive deep to figure out the key effects that may lead to obesity.
We hope we can tell a story and educate people to obtain a healthier lifestyle to prevent obesity.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings
warnings.simplefilter("ignore")
from pandas.api.types import is_string_dtype, is_numeric_dtype
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import MDS, TSNE
This data set is going to claim the trend of share of death related to obesity in the world, which is also a reminder to people that obesity is not very far from our daily life.
The data set contains the country's share of death related to obesity from 1990 to 2017, the data set information as follows:\ Entity: Country name\ CODE: Country code\ Year: From 1990 to 2017\ Obesity(IHME, 2019): Share of death related to obesity in the total death, Numerical (0, 1)
The data set could also be found here
ob = pd.read_csv('/Users/bryton/Desktop/ObesityDataSet/share-of-deaths-obesity.csv')
ob.shape
ob.info()
ob.head()
ob.describe()
sns.lineplot(data = ob, x = 'Year', y = 'Obesity (IHME, 2019)')
sns.histplot(data = ob, x = 'Obesity (IHME, 2019)', kde = True)
sns.boxplot(data = ob, x = 'Obesity (IHME, 2019)')
The Obesity dataset contains data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.
Dataset information:\ The attributes of basic information:\ Gender: Categorical ('Male' , 'Female')\ Age: Numerical\ Height: Numerical\ Weight: Numerical\ family_history_with_overweight: Categorical ('Yes', 'No')
The attributes related with eating habits:\ FAVC: Frequent consumption of high caloric food, Categorical ('Yes', 'No')\ FCVC: Frequency of consumption of vegetables, Categorical ('Yes', 'No')\ NCP: Number of main meals, Numerical\ SMOKE: Categorical ('Yes' , 'No')\ CAEC: Consumption of food between meals, Categorical ('Sometimes', 'Frequently', 'Always', 'No')\ CH20: Consumption of water daily, Numerical\ CALC: Consumption of alcohol, Categorical ('No', 'Sometimes', 'Frequently', 'Always')
The attributes related with the physical conditions: \ SCC: Calories consumption monitoring, Categorical ('No', 'Yes')\ FAF: Physical activity frequency, Numerical\ TUE: Time using technology devices, Numerical\ MTRANS: Transportation used, Categorical ('Public_Transportation', 'Walking', 'Automobile', 'Motorbike', 'Bike')
Target:\ NObesity: Categorical ('Insufficient Weight', 'Normal Weight', 'Overweight Level I', 'Overweight Level II', 'Obesity Type I', 'Obesity Type II', 'Obesity Type III')
The data set could be found here
# import our Obesity data set
df = pd.read_csv('/Users/bryton/Desktop/ObesityDataSet/ObesityDataSet.csv')
df.shape
# make sure there are no null values in the data set
df.isnull().sum()
# matadata of the data set
df.info()
There are 9 categorical values in the data set. We have to investigate these value and do feature engineering later on.
# simple stastics
df.describe(include = 'all')
df.Gender.unique()
df.family_history_with_overweight.unique()
df.FAVC.unique()
df.CAEC.unique()
df.SMOKE.unique()
df.SCC.unique()
df.CALC.unique()
df.MTRANS.unique()
df.NObeyesdad.unique()
for column in df:
plt.figure(column)
plt.title(column)
if is_numeric_dtype(df[column]):
#df[column].plot(kind = 'hist')
sns.histplot(df[column], kde = True, color = 'steelblue')
elif is_string_dtype(df[column]):
df[column].value_counts().plot(kind = 'bar', color = 'lightsteelblue', edgecolor = 'gray')
plt.xticks(rotation = 30)
# divide the data set into categorical and numerical subsets
# categorical list
cat = df[['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC',
'CALC', 'MTRANS', 'NObeyesdad']]
cat
# Numerical list
num = df[['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']]
num
cat_list = []
num_list = []
for column in df:
if is_numeric_dtype(df[column]):
num_list.append(column)
elif is_string_dtype(df[column]):
cat_list.append(column)
print(cat_list)
print(num_list)
# correlation matrix and the heatmap
correlation = df.corr()
sns.heatmap(correlation, cmap = 'Blues', annot = True)
plt.title('Correlation Between Numerical Variables')
plt.xticks(rotation = 45)
sns.pairplot(num, height = 2.5)
fig, axes = plt.subplots(9, 8, figsize=(180, 160))
for i in range(0, len(cat_list)):
cat = cat_list[i]
for j in range(0, len(num_list)):
num = num_list[j]
sns.boxplot(ax = axes[i, j], x = cat, y = num,
data= df, palette = 'Blues')
for i in range(0, len(cat_list)):
hue_cat = cat_list[i]
sns.pairplot(df, hue = hue_cat)